36 research outputs found

    Declarative Querying For Biological Sequences.

    Full text link
    Life science research labs today manage increasing volumes of sequence data. Much of the data management and querying today is accomplished procedurally using Perl, Python, or Java programs that integrate data from different sources and query tools. The dangers of this procedural approach are well known to the database community-- a) severe limitations on the ability to rapidly express queries and b) inefficient query plans due to the lack of sophisticated optimization tools. This situation is likely to get worse with advances in high-throughput technologies that make it easier to quickly produce vast amounts of sequence data. The need for a declarative and efficient system to manage and query biological sequence data is urgent. To address this need, we designed the Periscope/SQ system. Periscope/SQ extends current relational systems to enable sophisticated queries on sequence data and can optimize and execute these queries efficiently. This thesis describes the problems that need to be solved to make it possible to build the Periscope/SQ system. First, we describe the algebraic framework which forms the backbone of Periscope/SQ. Second, we describe algorithms to construct large scale suffix tree indexes for efficiently answering sequence queries. Third, we describe techniques for selectivity estimation and optimization in the context of queries over biological sequences. Next, we demonstrate how some of the techniques developed for Periscope/SQ can be applied to produce a powerful mining algorithm that we call FLAME. Finally, we describe GeneFinder, a biological application built on top of Periscope/SQ. GeneFinder is currently being used to predict the targets of transcription factors. Today, genomic and proteomic sequences are the most abundantly available source of high-quality biological data. By making it possible to declaratively and efficiently query vast amount of sequence data, Periscope/SQ opens the door to vast improvements in the pace of bioinformatics research.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/55670/2/tatas_1.pd

    A Benchmark for Structured Extractions from Complex Documents

    Full text link
    Understanding visually-rich business documents to extract structured data and automate business workflows has been receiving attention both in academia and industry. Although recent multi-modal language models have achieved impressive results, we find that existing benchmarks do not reflect the complexity of real documents seen in industry. In this work, we identify the desiderata for a more comprehensive benchmark and propose one we call Visually Rich Document Understanding (VRDU). VRDU contains two datasets that represent several challenges: rich schema including diverse data types as well as nested entities, complex templates including tables and multi-column layouts, and diversity of different layouts (templates) within a single document type. We design few-shot and conventional experiment settings along with a carefully designed matching algorithm to evaluate extraction results. We report the performance of strong baselines and three observations: (1) generalizing to new document templates is very challenging, (2) few-shot performance has a lot of headroom, and (3) models struggle with nested fields such as line-items in an invoice. We plan to open source the benchmark and the evaluation toolkit. We hope this helps the community make progress on these challenging tasks in extracting structured data from visually rich documents

    Practical methods for constructing suffix trees

    Full text link
    Sequence datasets are ubiquitous in modern life-science applications, and querying sequences is a common and critical operation in many of these applications. The suffix tree is a versatile data structure that can be used to evaluate a wide variety of queries on sequence datasets, including evaluating exact and approximate string matches, and finding repeat patterns. However, methods for constructing suffix trees are often very time-consuming, especially for suffix trees that are large and do not fit in the available main memory. Even when the suffix tree fits in memory, it turns out that the processor cache behavior of theoretically optimal suffix tree construction methods is poor, resulting in poor performance. Currently, there are a large number of algorithms for constructing suffix trees, but the practical tradeoffs in using these algorithms for different scenarios are not well characterized.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/47869/1/778_2005_Article_154.pd
    corecore